Anomaly classification, analytics and resolution based on annotated event logs

ABSTRACT

Operational event loggings and operational alarm productions within a running multiserver data processing system are automatically and repeatedly sampled and co-associated with one another so as to build annotated logs that can be used by post-process analytics for filling in mappings thereof into an anomalies versus parameters mapping space and for keeping track of unusual changes in the mappings or their rates where the unusual changes can be indicative of emerging new problems of significance within the system.

FIELD OF DISCLOSURE

The present disclosure relates generally to multi-device systems such asused in integrated client-server/internet/cloud computing environmentswhere plural physical and virtual data processing machines and/or otherresource consuming constructs are disposed in respective sections of aninterconnected fabric of client devices (e.g., smartphones), servers(real and virtual), communication resources (e.g., wired and wireless),data storage resources (e.g., databases) and so on for carrying outdesired data processing and data communicating operations. Thedisclosure relates more specifically to machine-implemented methods forautomatically determining what constitutes emerging anomalous behaviorof significance in such multi-device systems and for automaticallyproviding machine-implemented adaptive classification of anomalies andproactive resolutions.

DESCRIPTION OF RELATED TECHNOLOGY

In large-scale multi-device systems such as those using “cloudcomputing” (e.g., cloud based servicing of requests received from largenumbers of mobile and/or stationary client machines), many things can gowrong. Communication channels may break down or experience excessiveinterference. Data storage units may begin to exhibit unacceptablelatencies or difficulties in reading and/or writing desired dataportions. Power supplies and/or their fans may fail or worse yet, slowlybegin to intermittently degrade. Magnetic or other kinds of disk drivesystems may crash or worse yet, slowly begin to intermittently degrade.Electrical interconnects may develop intermittent opens or shorts thatslowly become more frequent over time. DRAM memory chips may experienceunusually large numbers of soft errors. Software program operations maygo awry. These are merely illustrative examples.

Operations management teams who manage day to day operations of suchlarge-scale multi-device systems (e.g., cloud based systems) often wishto proactively get ahead of emerging problems and nip them in the bud sothat the latter do not become catastrophic system failures. When acatastrophic system crash occurs, commercial and/or other system usersmay experience an inability to use mission critical hardware and/orsoftware. Examples of mission critical system users include hospitalsand/or other medical service providing institutions, banks and/or otherfinancial service providing institutions, police and/or other securityservice providing organizations and so on. Needless to say, systemcrashes for such entities may have disastrous consequences.

Given the severity of consequences in many failure scenarios, it isdesirable to develop automated analytics systems that automaticallylearn to distinguish between cases where normal or routine anomalies ofthe day to day system operations kind are occurring and where lessroutine but significant anomalies begin to emerge within the noisebackground of the insignificant, normal anomalies of the day to daykind. System management teams should be automatically alarmed when trulysignificant anomalies begin to appear as opposed to being alarmed forevery one of the routine day to day kinds of anomalies. Too high of arate of alarms for insignificant routine problems can interfere withefficient operation of the large-scale multi-device system. Morespecifically, false alarms and/or alarms for insignificant events candrive up operational costs, exhaust operational personnel and renderthem insensitive to alarmed situations where there actually is a trulysignificant problem that is emerging and must be quickly taken care of.This can be considered a classification problem.

The question presents as how to form an automated system that adaptivelylearns to distinguish between “truly” significant ones of emergingproblems and those that are routine events within the normal day-to-dayoperations of the system. In the past operators relied on historicalperformance pictures (performance snapshots), regression analysis (e.g.,determining what is “normal” or average based on past performances) andthen detecting supposedly-significant deviations from the historicalnormals (from the regression-produced, “normal” curves).

There are several problems with such a regression analysis and deviationdetect approach. First it is not definitively known, and thus primarilyguess work as to what should be the observed driving and drivenvariable(s) of a regression analysis. Should hour of the day be adriving factor? Should it be day of the week? Should it be number oflogged-in users or combinations of these and/or other possible drivingvariables? Then of course there is also the question of what the drivenvariable(s) of the regression analysis should be. In other words, isthere a true cause and effect relationship between selected driving andcorrespondingly selected driven factors? Possible, but not limitingexamples of options for driven factors include CPU utilizationpercentage or absolute rates, DRAM utilization percentages/rates, diskdrive utilization percentages/rates, I/O utilization, power consumption,and so on. Then, for the regression analysis itself, there are manypossible algorithms to pick among, including; but not limited to, linearregression, parabolic regression, piece-wise linear regression,piece-wise parabolic regression, higher ordered continuous and/orpiece-wise such power series regression formulas or mixes thereof.Additionally, operators may arbitrarily choose to use merely a singledriven and a single driving variable, or they may assume plural drivingfactors for a single driven variable or alternatively multiple drivenand driving variables. They may further choose different widths andsampling rates for their regression analysis windows (e.g., as takenover what length of time, at what sampling rate, etc.?). With all ofthese, it is not definitively known what to pick, and thus it isprimarily guess work (often falsely justified as being “educated” guesswork). It is to be understood that the description given here does notmean that any part or all of this was recognized heretofore.

After specific ones among an astronomically large range of possibleregression methods are picked for use with selected driven/drivingvariables and after operators have produced a supposedly “normal”behavior curve (or curves or N-dimensional “normal behavior” surfaces),the question still remains as to what is the amount of deviation and/orwhat are the number of times that such deviation(s) need to be presentin order to declare the corresponding event(s) as truly significantanomalies that are worthy of having follow up work conducted for them.The follow up work may include identifying the alleged root cause(s) forthe declared-as-significant anomaly and changing the system so as tosupposedly “fix” the root cause(s) without creating additional problems.

As indicated above, it is to be understood that this background of thetechnology section is intended to provide useful background forunderstanding the here disclosed technology and as such, this technologybackground section may include ideas, concepts or recognitions that werenot part of what was known or appreciated by those skilled in thepertinent art prior to corresponding invention dates of subject matterdisclosed herein. In particular it is believed that prior art artisansdid not appreciate wholly or at least in part all of the problemsassociated with reliance on the regression analysis and deviation detectapproach. Moreover, it is believed that prior art artisans did notappreciate wholly or at least in part that there are other options topursue.

SUMMARY

Structures and methods may be provided in accordance with the presentdisclosure for providing a more knowledgeable kind of machine automated,adaptive learning for distinguishing between significant ones ofemerging anomalies in system behavior that are worthy of speciallyalarming for and those that are merely routine anomalies.

More specifically, in accordance with one aspect of the presentdisclosure, a machine-implemented method is provided for keeping trackin an anomalies versus parameters mapping space of previously identifiedand emerging anomalies of a data processing system where the methodcomprises: running a first section of the data processing system wherethe first section includes a section alarming subsystem and a sectionbehaviors logging subsystem, the section alarming subsystem beingconfigured to generate alarms for alarm-worthy events within the firstsection, the section behaviors logging subsystem being configured togenerate a log of monitored behaviors within the first section;logically co-associating recently logged behaviors of the generated logproduced by the section behaviors logging subsystem with substantiallycotemporaneous alarms generated by the section alarming subsystem andrecording the logical associations; building an annotated log comprisedof the logically co-associated logged behaviors and substantiallycotemporaneous alarms; using the annotated log to keep track in acorresponding first anomalies versus parameters mapping space ofpreviously identified as routine and emerging anomalies of the firstsection of the data processing system; and automatically repeating saidco-associating, building and using steps while the first section of thedata processing system continues to run.

Other aspects of the disclosure will become apparent from the belowdetailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

The below detailed description section makes reference to theaccompanying drawings, in which:

FIG. 1 is a block diagram schematically showing an interconnectedmulti-device system having embedded alarming and logging subsystems;

FIG. 2 is a block diagram schematically showing an interconnectedmulti-device system having embedded alarming and logging subsystemswhere post-process analytics blocks adaptively develop respectiveanomalies classification systems based on continuously updated andannotated behavior logs;

FIG. 3 is a flow chart depicting an automated process for buildingrespective and continuously updated, annotated behavior logs forrespective sections of the interconnected multi-device system of FIG. 2;

FIG. 4A is a first Venn like diagram illustrating how to map normal orroutine day-to-day anomalies based on their occurrences within amulti-parameter space for respective sections of a large system and howto spot emergence of multi-sectional anomalies;

FIG. 4B is a second Venn like diagram similar to that of FIG. 4A andfurther illustrating how to map histogramic behaviors in various regionsof the multi-parameter space;

FIG. 5 is a schematic diagram depicting an over-time build up andupdating of annotated logs for use in anomaly space mapping, histogramicbehavior mapping and emerging behavior analysis; and

FIG. 6 is a flow chart depicting an automated process for using theresults of the annotated logs to proactively resolve emerging problemsof significance.

DETAILED DESCRIPTION

FIG. 1 is a block diagram showing an integratedclient-server/internet/cloud system 100 (or more generically, anintegrated multi-device system 100) to which the here disclosedtechnology may be applied. The illustrated system 100 is merelyexemplary and comprises one or more client devices 110 (only one shownin the form of a wireless smartphone but understood to represent manyand not only of the smartphone client kind); one or more wired and/orwireless communication fabrics 115 (only one shown in the form of awireless bidirectional interconnect) coupling the client(s) 110 tonetworked servers 120 (not explicitly shown) where the latter mayoperatively couple by way of further wired and/or wireless communicationfabrics 125 (not explicitly shown) to further networked servers 130 (notexplicitly shown). The second set of networked servers 130 is depictedas a “cloud” 130 for the purpose of indicating a nebulous and constantlyshifting, evolving set of hardware, firmware and software resources. Asthose skilled in the art of cloud computing will appreciate, the “cloud”130 may be implemented as reconfigurable virtual servers and virtualsoftware modules implemented across a relatively seamless web ofphysical servers, storage units (including flash BIOS units),communication units and the like such that failure of specific unitswithin the physical layer are overcome by shifting the supported virtualresources to spare other support areas in the physical layer. Because ofsheer size and also this constantly shifting and self-reconfiguringfabric of resources, it can be very difficult to spot emerging problemsof significance.

A quick and introductory walk through FIG. 1 is first provided here sothat readers may appreciate the bird's eye lay of the land, so to speak.Item 111 represents a first user-activatable software application (firstmobile app) that may be launched from within the exemplary mobile client110 (e.g., smartphone, but could instead be a tablet, a laptop, awearable computing device; i.e. smartwatch). Item 113 represents asecond such user-activatable software application (second mobile app)and generally there are many more. Each client-base application (e.g.,111, 113) can come in the form of nontransiently recorded digital code(i.e. object code or source code) that is defined and stored in a memoryfor instructing a target class of data processing units to perform inaccordance with client-side defined application programs (‘mobile apps’for short) as well as to cooperate with server side applicationsimplemented on the other side of communications link 115.

One example of a first mobile app (e.g., 111) could be one that has beendesigned to service a particular business organization (e.g., Book Store#1) in accordance with how that particular business organization choosesto organize itself. For example if a user (not shown) of the mobileclient 110 wants to browse through a collection of new books offered bythe business organization and perhaps buy some, the user may first beasked to download the first mobile app (e.g., 111) into his/her clientdevice 110. This will typically involve a download of app code from theInternet 120, through a wireless portion of the communications link 115and an operative coupling (“installation”) of the downloaded code withclient-side operating system code (OS) that typically has also beendownloaded via to link 115 into the client device 110. Next the useractivates a book-store browse feature of the first app 111 and it causesservice requests to go out through link 115 to targeted modules and/orservers within the Internet portion 120. Those targeted modules and/orservers may offload (delegate) parts of their data processing, storingand/or routing tasks to yet further resources within the “cloud” 130 byway of the illustrated, second communications link 125. (In actuality,the “cloud” 130 may be embedded or enmeshed within Internet 120 and thefirst and second communications links 115, 125 may be inseparablyintegrated one with the other. They are shown separately for the purposeof depicting how tasks may be delegated out over various resource andcommunications portions of the overall system 100.

If all the different parts are operating as desired, the cloud-basedresources (130) will timely and properly perform their delegated tasks,timely return results to the task delegators (e.g., in 120) and thelatter will then timely return appropriate results to the clienthardware and software of mobile device 110 whereby the user is able tobrowse the new books, buy desired ones and be charged appropriately forthem. By timely, it is often meant (depending on the task at hand) thatthe user experiences a request-to-results latency time of no more than asecond or two. However, delegated flows of data processing, storingand/or communication tasks may go awry due to congestions,interferences, intermittent and creeping-wise growing anomalies anywherewithin the complex system 110. Also at various times, “updates” areinstalled into various ones of reconfigurable resources of the system100 and such updates may introduce unexpected and sometimeslate-blooming problems into the system.

In order to deal in an orderly way with the massive size and complexityof the system 100, it is subdivided into management-defined “sections”.The size and contents of each section is left to the mangers of thesystem, but generally each section; where 140 and 160 are two examplesof such subdivisions, will include a limited number of intercoupled,“local” resources such as one or more local data processing units (e.g.,CPU 141), one or more local data storage units (e.g., RAM 142, ROM 143,Disk 146), one or more local data communication units (e.g., COMM unit147), and a local backbone (e.g., local bus 145) that operativelycouples them together as well as optionally coupling them to yet furtherones of local resources 148. The other local resources 148 may include,but are not limited to, specialized high speed graphics processing units(GPU's, not shown), specialized high speed digital signal processingunits (DSPU's, not shown), custom programmable logic units (e.g.,FPGA's, not shown), analog-to-digital interface units (A/D/A units, notshown), parallel data processing units (e.g., SIMD's, MIMD's, not shown)and so on.

It is to be understood that various ones of the merely exemplary andillustrated, “local” resource units (e.g., 141-148) may include or maybe differentiated into more refined kinds. For example, the local CPU's(only one shown as 141) may include single core, multicore andintegrated-with-GPU kinds. The local storage units (e.g., 142, 143, 146)may include high speed SRAM, DRAM kinds as well as configured forreprogrammable, nonvolatile solid state storage and/or magnetic and/orother phase change kinds. The local communication-implementing units(only one shown as 147) may operatively couple to various external datacommunicating links such as serial, parallel, optical, wired or wirelesskinds typically operating in accordance with various ones ofpredetermined communication protocols. Similarly, the other localresources (only one shown as 147) may operatively couple to variousexternal electromagnetic or other linkages 148 a and typically operatein accordance with various ones of predetermined operating protocols.

The expected “normal” behaviors for the various local resources 141-148of the given, local section 140 are defined by the system managers oflocal section 140. What is considered as “normal” behaviors in one localsection (e.g., 140) may be substantially different from what isconsidered as “normal” behaviors in another local section (e.g., 160).For example, local section 140 is depicted as being inside Internet 120(perhaps functioning as a web server inside 120) while local section 160is depicted as being inside “cloud” 130 (perhaps functioning as avirtual machines implementing unit inside 130). It is to be understoodthat the descriptions for the various resources within local section 160mirror those already provided for the various resources within localsection 140 and that the reference numbers correspond (e.g., CPU 161corresponds to CPU 141, RAM 162 corresponds to RAM 142, etc.).

In one embodiment, the expected and respective “normal” behaviors forthe respective and various local resources, 141-148, 161-168, etc. ofthe given, local sections 140, 160, etc. are defined by the respectivelocal system resource managers as knowledge base expert rules, storedand executed by respective local alarm generating subsystems 151 (ofsection 140), 171 (of section 160), and so on. An example of a localalarm generating rule might be: IF CPU clock speed <500 MHz THEN outputAlarm_number_CPULT500 ELSE IF CPU clock speed >3.8 GHz THEN outputAlarm_number_CPGT35. Another example might be: IF RAM Free Space <200 MBTHEN output Alarm_number_RAMFLT200 ELSE IF RAM Free Space >100 GB THENoutput Alarm_number_RAMFGT100. Each local alarm generating subsystem(e.g., 151, 171) is operatively coupled to its local section backbone(e.g., 145, 165) for acquiring in real time various performanceindicating signals such as CPU utilization indicators, storageutilization indicators, communication resources utilization indicators,and other resource utilization indicators for the respective localresources (e.g., 141-148, 161-168). Typically, the internal operations(e.g., local alarm generating knowledge base rules) of the local alarmgenerating subsystems (e.g., 151, 171) are unknown to the rest of thesystem and thus appear as black box modules that receive inputs and thendecide in black box manner whether to output alarms, and if so, whatkind.

In one embodiment, each local section (e.g., 140, 160) includes arespective, real-time resources management unit (e.g., 152, 172) coupledto receive the generated alarms output by its respective local alarmgenerating subsystem (e.g., 151, 171). The real-time resourcesmanagement unit (e.g., 152, 172) is configured to respond in accordancewith local management defined appropriate ways to the locally generatesalarms. The local management defined appropriate ways to respond mightinclude doing nothing or simply counting how many times a certain kindof local alarm is output and/or the rate at which it is output. Anexample of a local alarm process rule might be: IFAlarm_number_RAMFGT100 THEN Increment Excess Free Space Count by 1; IFExcess Free Space Count >100 THEN Output Alarm_number_RAMFCGT101 andReset Excess Free Space Count; IF Alarm_number_RAMFGT100 AND Time oflast Alarm_number_RAM-FGT100<200 ms THEN Output Alarm_number_RAMFCRLT200ELSE Return.

Other responses of the respective, real-time resources management units(e.g., 152, 172) might be to implement short term fixes (154, 157) suchas, IF Alarm_number_RAMFCRLT200=True_for_2RAMs Then Move Data in FirstUnderutilized RAM to Second Underutilized RAM AND ReMap RAM AddressSpace AND Place First Underutilized RAM into Low_Power_Standby_Mode. Inother words, in this last example, the short term or quick fix (154) isto detect two underutilized resources of a same kind (it could have beenDISK or COMM instead of RAM) and move utilization load into one of themso that the full utilization span of the other is made available and sothat, for some cases, power consumption is reduced. However, this kindof short term fix does not address the underlying cause. Why did thealarmed situation arise in the first place and are there any long termmodifications to be made to the system 100 so as to proactively avoidthe alarmed situation to begin with? Again, it is to be noted thattypically, the internal operations (e.g., local alarm generatingknowledge base rules) of the local alarm generating subsystems (e.g.,151, 171) are unknown to the rest of the system and thus appear as blackbox modules. Moreover, it is to be understood that the internaloperation settings of the local alarm generating subsystems (e.g., 151,171) are not static and can change from time to time. The respectivestructure of each system section (e.g., 140, 160) is also not static andcan change from time to time. For example, system managers mayoccasionally decide to increase or decrease the amount of volatilememory (e.g., 142/146; 162/166) present in specific ones of the systemsections.

While the local real-time resource management units (e.g., 152, 172)provide short term fixes (154, 157) in response to some of the localalarms, generally, the detection and resolution of long term problems isleft to a non-real-time, post-process analytics subsystem (e.g., 157,177) of the respective local section (e.g., 140, 160). In other words,the real-time resources management units (e.g., 152, 172) are dedicatedto quickly detecting problems of the short term kind and patching themup with whatever short term fix (e.g., 154, 174) seems most expeditiousso that the section remains operational and so that the respectivereal-time resources management unit (e.g., 152, 172) can move on todetecting and resolving the next alarmed, real time situation. Incontrast, the post-process analytics subsystem (e.g., 157, 177) is giventime to go over historical records (e.g., 156, 176) and to applytime-consuming analytics to them so as to spot long term trends and comeup with long term solutions (e.g., 158, 178) to spotted ones of the longterm problems. To this end, an events logging subsystem (e.g., 155, 175)is provided in each local section (e.g., 140, 160) and operativelycoupled to the local resources of that section for recording into arespective, local section performance log (e.g., 156, 176) sectionperformance values for each local section event. The section performancevalues may include various resource utilization indicators for therespective local resources (e.g., 141-148, 161-168) such as CPUutilization indicators, storage utilization indicators, communicationresources utilization indicators, and other appropriate utilizationindicators deemed appropriate for substantively reporting the state ofthe respective local section at the time of the logged event.

Events that trigger performance logging thereof into the respective,local section performance log (e.g., 156, 176) may vary from section tosection. A common type of event is a periodic status recording eventwhere the local section state is periodically recorded into the localsection performance log (e.g., 156, 176) say every 100 milliseconds(ms). Another type of event that may be often logged is successfulcompletion of a task assigned to the local section. Yet another type ofevent often logged may be an unsuccessful ending of a task assigned tothe local section, for example when the task is terminated due to anerror. Task terminations due to error do not necessarily mean that acorresponding alarm will be output by the local section alarmingsubsystem (e.g., 151, 171). The signal inputs (151 a, 171 a) to thealarming subsystem (e.g., 151, 171) are not necessarily all the same asthose (155 a, 175 a) of the events logging subsystem (e.g., 155, 175) ofthe respective section. The timings of the respective result outputs(e.g., of alarms, of event log records) of the local section alarmingsubsystem (e.g., 151, 171) and of the events logging subsystem (e.g.,155, 175) are also not necessarily the same. Each has its own assignedjob (alarm generation and event logging) and generally performs itindependently of the other.

As indicated above, typically the post-process analytics subsystem(e.g., 157, 177) of each respective section does not have access to theinternal logic (e.g., expert knowledge base logic) of the correspondingsection alarming subsystem (e.g., 151, 171) and vice versa. That meansthe post-process analytics is generally performed without benefit of theknowledge base logic embedded in the corresponding section alarmingsubsystem.

Referring to FIG. 2, shown is a system 200 in accordance with thepresent disclosure where the post-process analytics subsystem (e.g.,257, 277) of each respective section (e.g., 240, 260) is given a way ofgaining access (at least in an indirect way) to the knowledge base logicembedded in the corresponding section alarming subsystem (e.g., 151′,171′). For sake of brevity, many of the reference numbers used in FIG. 1are repeated as primed ones in FIG. 2 or changed from the 100 centuryseries to the 200 century series. For example, the mobile client device110 of FIG. 1 becomes mobile client device 110′ of FIG. 2. CPU 141 ofFIG. 1 becomes CPU 141′ of FIG. 2 and so on. Thus, these do not need tobe described again.

The internal structures of the local sections of FIG. 2 are changed andthus what was referred to in FIG. 1 as local section 140 becomes section240 in FIG. 2. Similarly section 160 becomes section 260 in FIG. 2. Amajor change in each of the sections (e.g., 240, 260) is that therespective post-process analytics block (257, 277) is using a so-called,“annotated log” (e.g., 256, 276) for performing its respective localsection analytics in place of using merely the “raw” event time versusperformance log (156′, 176′) output by the respective events loggingsubsystem (e.g., 155′, 175′). By annotating the “raw” event time versusperformance log (e.g., 156′, 176′), the system 200 gains access (atleast in an indirect way) to the knowledge base logic embedded in thecorresponding section alarming subsystem (e.g., 151′, 171′) and thus camperform post-process analytics (e.g., 257, 277) on a more knowledgeable,and thus improved basis.

In one embodiment, the local “annotated log” (for example log 256 ofsystem section 240) is generated by concatenating to each recorded oneof the events in raw performance log 156′ either an indication that noalarms 250 were output in a corresponding time slot by the section alarmgenerator 151′ (e.g., by indicating Alarm(s)_current=: FALSE inside theannotated log) or concatenating to that raw event record an indicationthat one or more alarms 250 were output (e.g., by indicatingAlarm(s)_current=: TRUE) and optionally identifying the number of suchalarms and their types (e.g., by indicating Alarm(s)_current_of_Type1=:1, Alarm(s)_current_of_Type2=: 3, Alarm(s)_current_of_Type3=: 0, etc.).

Referring briefly to FIG. 5, shown is an example of a two-part annotatedlog 556/557 where portion 556 alone constitutes a raw events log whilethe co-aligned portion 557 constitutes a concatenated-onalarms-indicating section. In this particular example (which will befurther detailed below) a time slot matching module 551 organizes therecords of portions 556 and 557 so that the logged event time 556 a ofeach raw event log record is substantially within a same time slot asthe alarm time 557 a of a corresponding one or more of the alarms (ifany). Columns 557 b, 557 c, 557 d, etc. are filled with respectiveindications of whether occurrence of specific types of alarms are Trueor False and optionally filled with details about the alarm types (e.g.,how many? where located within the section? etc.).

Referring back to FIG. 2, records of the annotated log 256 are providedto an annotation-based post-process analytics module 257. It will becomeapparent that new records (more current records) are constantly beingadded to the annotated log 256 as newer events are recorded by theevents logging subsystem 155′ and as newer alarms 250 are output by thesection alarm generator 151′. The post-process analytics module 257automatically and repeatedly accesses the newer records of the annotatedlog 256 as new sample points and automatically, repeatedly updatesvarious analytics models (not all shown) that it generates to representits current understanding of the status and trajectory of its watchedsystem section 240.

One of the analytics models that the post-process analytics module 257creates and automatically, repeatedly updates is a simulated version 261of the section's alarm generator 151′. As mentioned above, typically theinner workings of the section's alarm generator 151′ are unknown to, andthus present as a black box to the post-process analytics module 257.However, with aid of the annotated log 256, the post-process analyticsmodule 257 can start developing its own mapping of a parameters space soas to identify regions in parameter space that typically result in alarmgeneration and those that typically do not. Then, using this over-time,painted-in mapping of parameter space (e.g., filling in with samplepoints, the post-process analytics module 257 can extrapolate towardsidentifying in a broader sense, which regions in parameter spacetypically result in alarm generation, which typically do not, and whattypes of alarms in each region and at what frequency of occurrence. Thisextrapolated mapping becomes the foundation of the simulated version 261of the section's alarm generator 151′.

Referring to FIG. 4A, shown is a simplified Venn-diagram like mapping ofa two dimensional (2D) parameter space 400. In practice, parameter space400 is many-dimensional (e.g., 3D, 4D, etc.) and may include time andlocation axes. For sake of simplicity, a 2D version is illustrated.Vertical axis line 410 represents a first operational parameter, such asfor example; relative CPU utilization as measured on a normalized scaleextending from 0% to a maximum value of 100%. In an alternateembodiment, the Param-1 axis 410 may represent CPU utilization inabsolute terms such as for example, extending from 0 executedinstructions per second (EI/S) to 3 Giga-EI/S and mappedlogarithmically. (This is merely a nonlimiting example.) Similarly, thehorizontal axis 420 represents a different second operational parameterof system section 240, such as for example; relative volatile memoryutilization on a normalized scale extending from 0% to a maximum valueof 100% of capacity. In an alternate embodiment, the Param-2 axis 420may represent RAM utilization in absolute terms such as for example,extending from 0 non-free GigaBytes to 8 non-free GigaBytes and mappedlogarithmically. (This is merely a nonlimiting example.)

Circular region 401 represents a bounded area (synthesized area) withinthe 2D parameter space 400 of system section 240 in which Type “1”alarms are expected to issue based on extrapolation from a set of“included” sample points (ASP's) and “excluded” sample points (NASP's).When it starts to access the records of the annotated log 256, thepost-process analytics module 257 does not yet know that region 401 willturn out to be a single circular bounded area with no voids in it. TheType “1” alarms including region 401 could instead turn out to beseveral spaced apart, bounded shapes with or without voids inside one ormore of them. (The shape of circular unified, Type “1” alarms includingregion 401 is used here for simplicity sake. In general, alarmsincluding regions can have various shapes including disjointed ones andvoid containing ones.) However, as the post-process analytics module 257starts plotting-in (e.g., marking-in as sample point dots) AlarmedSample Points (ASP's) that include a Type “1” alarm, the analyticsmodule 257 will slowly start learning, based on the painted-in,“included” sample points (e.g., ASP_(01a), ASP_(01b), ASP_(12a),ASP_(12b), ASP_(01b), ASP_(01e), . . . —moving clockwise around theinterior of the circular boundary of Type “1” alarmed region 401) thatsaid boundary is a circular one. In addition to the Alarmed SamplePoints (ASP's) that include the Type “1” alarm, there may be Non-AlarmedSample Points (NASP's) in the first parameter space 400 for which noalarms are typically issued. In the illustrated example, the Non-AlarmedSample Points include NASP_(00a), NASP_(00b), NASP_(00c), NASP_(00d),NASP_(00e), . . . —moving clockwise around the exterior of the circularboundary of Type “1” alarmed region 401 where the latter Non-AlarmedSample Points (NASP's) indicate regions of the first parameter space 400from which the Type “1” alarms are excluded. By automatically andrepeatedly filling in ASP's and NASP's into the first parameter space400 on the basis of newer records found in the automatically, repeatedlyupdated, annotated log 256 of system section 240, the correspondinganalytics module 257 can adaptively learn the contours of the Type “1”alarms-including region 401.

Similarly, the contours of the Type “2” alarms-including region 402 canbe discerned over time as inner boundary hugging, Alarmed Sample Points(ASP's) thereof appear. More specifically, as ASP_(02a), ASP_(02b),ASP_(02c), ASP_(123b), ASP_(123a), . . . and so on appear—movingclockwise around the interior of the circular boundary of Type “2”alarmed region 402, the outer contours of that region begin to emerge.Also Non-Alarmed Sample Points such as NASP_(00b), NASP_(00f),NASP_(00c), . . . and so on begin to show parts of parameters space 400from which the Type “2” alarms-including region 402 is excluded. In FIG.4A, a notation of the form, ASP_(n1, n2, . . . , nm) indicates that thecorresponding sample point is one at which alarms of Types n1, n2, . . ., nm appeared. A zero in the ASP or NASP subscript number is merely aplace holder meaning no correspondingly numbered alarm issued. Morespecifically, ASP_(123a) and ASP_(123b) respectively indicate samplepoints of space 400 where all of Type “1”, Type “2” and Type “3”appeared. By the same token, ASP_(13a) and ASP_(13b) respectivelyindicate sample points of space 400 where each of Type “1” and Type “2”alarms appeared, but not Type “3”. Given this, it may be appreciated howsample points ASP_(13b) and ASP_(01f) help to delineate the exterior andinterior of the boundary for a third or Type “3” alarms-including region403. In this example, Type “2” alarms may appear inside region 403. Forexample, all of ASP_(13a), ASP_(13b), ASP_(123a), ASP_(123b) indicatethat a Type “2” alarm appeared at that respective sample point withinthe first parameters space 400.

The parameter axes (e.g., 410, 420) of the first parameters space 400are not dictated by the alarm-input parameters (AIP's 151 a′) used bythe section's alarm generator 151′ for generating its alarms. AlthoughFIG. 2 schematically shows each of alarm generator 151′ and eventslogger 155′ receiving its respective inputs (AIP's 151 a′ and ESPI's 155a′) from a common section bus 145′ (or other form of interconnect fabric145′) that does not mean the Event Sampled Parameter Inputs (e.g.,ESPI's 155 a′) are the same as the alarm-input parameters (AIP's 151a′). There could be overlap. However, it is not necessary.

Examples of Event Sampled Parameter Inputs (e.g., ESPI's 155 a′)recorded by the events logger 155′ into the raw events log 156′ mayinclude, but are not limited to: time or time slot of the event; day ofthe week (e.g., Monday, Tuesday, etc.); month of the year (e.g., Jan.,Feb., etc.); portion of the local section where the recorded event iscentered (e.g., CPU, RAM, . . . , Comm, Other) and so on. Additionalinput parameters (ESPI's 155 a′) used by the events logging subsystem155′ to update the raw (non-annotated) log 156′ may include: (a) currentCPU utilization rate (preferably in absolute terms; e.g., instructionsper second rather than relative terms; e.g., percentage of maximuminstructions per second); (b) current volatile memory utilization rate(preferably in absolute terms; e.g., bytes per second written and/orread); (c) current volatile memory filled, free and unusable spaceamounts (preferably in absolute terms; e.g., bytes free, bytes filled,bytes unusable rather than relative terms; e.g., percentage of maximumcapacity filled, free and marked as unusable); (d) current localbackbone (145′) data transfer rates and error rates (preferably inabsolute terms; e.g., bytes per second transferred and number of packetsper second with correctable (ECC) errors and with noncorrectableerrors); (e) current nonvolatile memory (e.g., Flash, disk) utilizationrates (preferably in absolute terms); (f) current nonvolatile memoryfilled, free and unusable memory space amounts (preferably in absoluteterms); (g) current external communication (147 a′) data transfer ratesand error rates (preferably in absolute terms) and so on. Axes of theautomatically and repeatedly built and refined multi-dimensional map 400(FIG. 4A) may further include a time of day one (e.g., in 15 minuteaccumulative blocks), a time of week one (e.g., Monday, Tuesday, etc.),time of month (e.g., first week, second week etc.) and time of year one(e.g., day 107 of 365), holiday or nonholiday; indoor and outdoorweather conditions (e.g., storms, hot, cold, etc.). Axes of theautomatically and repeatedly built and refined multi-dimensional map mayfurther include local location ones such as indicating which of wiringinterconnects were involved in the event. One of the goals of theparameter spaces mapping functions of the post-process analytics module257 may be to automatically determine which of the ESPI's 155 a′correlate as result-influencing and non-redundant parameters forobserved alarms 250 and which do not.

Another goal of the post-process analytics module 257 may be toautomatically form the mimicking version 261 of the local alarmsgenerator 151′ where part of the forming step is determining which ofthe ESPI's 155 a′ to use as drive inputs 261 a for the mimicking alarmsgenerator 261 (also referred as simulated (“sim”) alarms generator 261)and which not to bother using (e.g., because they are redundant or notone of the result-influencing parameters). The utilized ESPI's are fedin as inputs 261 a into the reconfigurable, sim alarms generator 261.The utilized internal logic of the reconfigurable, sim alarms generator261 can include knowledge base rules similar to those used in (buthidden inside of) the local alarms generator 151′, an exception beingthat the internal logic of the sim alarms generator 261 is responsive toits chosen parameters 261 a. The post-process analytics module 257accesses the non-hidden internal logic of the sim alarms generator 261by way of coupling 261 b for changing that logic and/or analyzing it. Acomparator 265 compares the alarms 250 output by the local alarmsgenerator 151′ against the alarms 261 o output by the sim alarmsgenerator 261. The comparison results 266 inform the analytics module257 of differences between the behaviors of the local alarms generator151′ and the sim alarms generator 261. For example, if the visibleinternal logic of the sim alarms generator 261 (accessible via coupling261 b) does output an alarm (261 o) while the hidden internal logic ofthe local alarms generator 151′ does not, that may indicate that thereis an exception (e.g., a void in a Venn region of FIG. 4A) that themodel inside the sim alarms generator 261 does not yet know about. Theanalytics module 257 uses the comparison results 266 to either updatethe sim alarms generator 261 or otherwise investigate why there is adifference in results. (It could be, for example, that the local alarmsgenerator 151′ is in error and the sim alarms generator 261 is not.)

Over the course of time, as the post-process analytics module 257automatically develops a better understanding of when and for whatreasons alarms (e.g., 250, 261 o) are generated for operations withinits local system section 240, it may generate long term fixes 259 forhow its local system section 240 operates. For example, one long termfix 259 may reconfigure how the local real-time resources managementunit 152′ operates in response to alarms 250. Such a long term fix 259may cause the real-time resources management unit 152′ to be morereactive or less reactive to certain kinds of alarms. Another long termfix 259 may reconfigure one or more of the local resources 141′-148′ ofthe local system section 240. For example, local operations controllingsoftware in the local nonvolatile memory (e.g., 143′, 146′) might bereconfigured for proactively overcoming emerging anomaly tends.

Referring back to FIG. 4A, assume for example that region 404 of thefirst parameter space had until very recently been free of alarms. Thensuddenly new alarms such as ASP_(04a) begin appearing in region 404.That may inform the local analytics module 257 that a new trend isemerging. Something has changed with respect to how its local systemsection 240 is being used. This emerged change in behavior is notroutine and thus may be a situation of significant concern. Accordingly,through use of the annotated log 256 and historical mapping of routineASP's in regions 401, 402, 403 (as an example), the local analyticsmodule 257 can be alerted to emerging anomalies (e.g., ASP04 a of region404) that may be of significance.

Referring back to FIG. 2, post-process analytics results (e.g., 291,292) from separate system sections (e.g., 240, 260) may be mergedtogether in a hierarchically organized manner (295, 297, etc.) tothereby develop a broader picture of emerging trends within the overallsystem 200. For example, and referring also to FIG. 4A, it could be thatwithin the same time frame when new alarms (e.g., ASP_(04a)) beginemerging in previously quiet region 404 of the first parameter space 400of section 240, similar new alarms (e.g., ASP_(09a)) begin emerging in apreviously quiet region 409 of a second parameter space 440 associatedwith system section 260. The parent analytics module 295 which mergesthe analytics results (219, 292) of sections 240 and 260 may beprogrammed to detect the substantially cotemporaneous emergence 470 ofnew alarms in previously quiet parameter space regions 404 and 409. Thismay inform the parent analytics module 295 that something morewide-spread is developing within the system 200 as a whole and theemerging trend is not confined merely to system section 240.

In FIG. 4A, the second parameter space 440 is understood to be thatassociated with the second system section 260 and to be amulti-dimensional mapping of its chosen parameters, for exampleparameter Param-1′ as enumerated along vertical axis 450 and parameterParam-2′ as represented along horizontal axis 460. At least some of theparameters (e.g., 450, 460) of the second parameter space 440 may beessentially the same as those (e.g., 410, 420) of the first parameterspace 400 such that various mathematical operations may be carried outwith them. For example, the sample point results of the two may be addedtogether to produce a combined mapping of the alarmed and non-alarmedparameter space regions of the two system sections (240, 260) whenconsidered in unison. As another example, portions of the two parameterspaces (400, 440) in which same type alarms overlap (e.g., 402 and 408)may be identified and/or portions of the two parameter spaces wherethere is a lack of any overlap or cross correlation may be identified.The identifications of such areas of substantial sameness and of starkdifferences may allow the parent analytics module 295 to gain insightinto the current states of the system sections (e.g., 240 and 260) whichit is charged with analyzing. In one embodiment, a determination isautomatically made as to which internal logic portions of the respectivesim generators, 261 and 281 are responsible for producing the alarmsthat are in the areas of substantial sameness for the two parameterspaces 400 and 440 and which are responsible for producing the alarmsthat are in the areas of substantial difference. A parent sim generator(not shown) may be synthesized for use within the hierarchical parentanalytics module 295 for use in gaining post-process analytics insightfor the system sections (e.g., 240 and 260) covered by the parent.Similar hierarchically organized structures and operations may beprovided within the grandparent analytics module 297 where the lattercombines the results of two or more parent analytics modules (only oneshown: 295). By so hierarchically combining the post-process analyticsresults of various sections (e.g., 240, 260) of the overall system 200,a grander machine-implemented understanding of the operational states ofthe system 200 may be obtained, including for example; identifying wherein the system, cotemporaneous and multi-sectional anomalous behaviorsemerge such as those 470 of regions 404 and 409 of FIG. 4A.

Referring to FIG. 4B, in one embodiment the respective sectionalanalytics modules (e.g., 257, 277) keep track of historical trendswithin their resource system sections (e.g., 240, 260), for example byway of maintaining histograms (e.g., 401 a, 401 b, 402 a, 402 c, 478 a,409 d) of the alarms generating behaviors of their actual and/or simm'edalarm generators relative to predefined subsections of the mapped,alarmed and non-alarmed parts of the respective parameter spaces (e.g.,400′ and 440′). More specifically, symbol 401 a represents an alarms-addrate histogram for the white elliptical subregion below it among thesubregions of Type “1” alarms region 401. The illustrated histogram 401a maintains rolling window statistics for its subregion as well askeeping track of more recent alarm additions within that subregionduring a predefined and also rolling, short term temporal window. Thusas indicated on the alarms-add rate histogram 401 a, there may be anormal (e.g., statistical mean, median, etc.) “low” rate of alarmadditions versus time, a normal “high” rate and a more recent currentrate, which in the case of histogram 401 a is between the normal low andnormal high rates. Thus the respective sectional analytics modules(e.g., 257) may automatically determine that there is no unusual,emerging trend currently developing in the subregion covered byhistogram 401 a. By contrast, exemplary histogram 402 c is showing acurrently, above normal high, additions rate (rate at which alarms aregenerated for that portion of parameter space) for its respectivesubregion of Type “2” alarms region 402. As a result, the respectivesectional analytics modules (e.g., 257) may automatically determine thatunusual behavior is currently emerging in the subregion covered byhistogram 402 c. The specific reaction to the noted unusual behavior mayvary based on application and may include waiting to see how long theunusual behavior (e.g., above normal high alarms rate) continues. Justas an above normal high alarms rate may be cause for concern, a belownormal low alarms rate may be cause for concern. It may indicate thatthe corresponding subregion is being inefficiently underutilized. It mayindicate that some of the resources associated with the underutilizedsubregion can be re-allocated to subregions (e.g., that of histogram 402c) exhibiting above normal high alarm rates.

Referring to FIG. 3, a machine-implemented method 300 for generating andusing annotated logs is described.

Entry may be made at 305 into process step 310. In step 310, acorresponding section (e.g., 240, 260) of the overall system (e.g., 200)is allowed to run within a live, real time environment or a testsimulation environment. The running section may be a data processingand/or data transmitting section and it may include one or more of thevarious section resources such as 141′-148′ illustrated for the case ofFIG. 2. Additionally, the running section includes a local alarmsgenerator such as 151′, a local events logging subsystem (e.g., 155′), araw events log (e.g., 156′), a data concatenater (e.g., 251), a memoryspace for maintaining an annotated log (e.g., 256) and a post-processanalytics module (e.g., 257) that is operatively coupled to the rawevents log and to the memory space of the annotated log for purpose ofcarrying out post-process analytics based on the data stored in at leastone of the raw events log (e.g., 156′) and the annotated log (e.g.,256).

At subsequent process step 320, the time line is subdivided intopredetermined slots (e.g., one or more for each periodic event loggingrecord) and data regarding alarms generated during each time slot isconcatenated to (or otherwise logically linked to) data of acorresponding event logging record.

Reference is now made to FIG. 5 which depicts an exemplary data flow 500that include the formation of an annotated log 556/557. Line 540represents a first time line (T1) along which respective alarm reports(e.g., 541, 545, 546, etc.) are output by a local alarms generator(e.g., 151′ of FIG. 2). Line 530 represents a second time line (T2)along which respective event log reports (e.g., 555) are output by alocal events logging subsystem (e.g., 155′ of FIG. 2). The log time 531of each respective event report (e.g., 555, only one shown) is generallynot coincident with the T1 timings of associated alarm reports (e.g.,541, 545, 546, etc.). In one embodiment, an alarms-inclusion window (notshown) is defined about the event log time 531. The temporal length andphasing of the alarms-inclusion window (not shown) relative to the eventlog time 531 may vary. For example, the start and end points of thealarms-inclusion window may be equidistant from the event log time 531;or one of these points may be coincident with event log time 531; ordisplaced in another appropriate manner relative to the event log time531. Alarm reports (e.g., one or more of 541, 545, 546, etc.) whosealarm times fall within the alarms-inclusion window (not shown) of anevent report (e.g., 555) are deemed to be logically associated with thatevent report.

In one embodiment, the temporal length and phasing of thealarms-inclusion window (not shown) relative to the event log time 531is made a function of system section and context, where for exampletemporal length is relatively long for a first section under acorresponding first contextual situation and temporal length issubstantially smaller for a second section and under a correspondingsecond contextual situation. More specifically, there may be differentdrifts of clocks as between alarm generation and event logging indifferent section s and on different systems. There can be differencesin cadence of logging in different logs of respective different systemsections. The differences may be functions of system bandwidth and userutilization, of quality of signal transmission in different parts of thenetwork, and so forth. In one embodiment, an expert knowledge databaseof rules is automatically consulted and used for setting temporal lengthand phasing of the alarms-inclusion window in each system section. Anexemplary knowledge database might read: IFUsage_of_section_resources<Threshold_1 AND Day_of_Week=Weekend THENalarms-inclusion_window.length=L1 ANDalarms-inclusion_window.phase=50%/50% ELSE IFUsage_of_section_resources>Threshold_1 AND Usage_of_section_resources<Threshold_2 AND Day_of_Week=Wednesday THENalarms-inclusion_window.length=L2 ANDalarms-inclusion_(—)window.phase=30%/70% ELSE IFQuality_of_Packets<Threshold_3 THEN . . . , where here L1 and L2 arerespective predetermined constants indicative of window lengths andThreshold_1 as well as Threshold_2 are respective predeterminedconstants indicative of resource usage amounts while Threshold_3 is aconstant indicative of QoS for data communication packets. The rules inthe knowledge database might alternatively or additionally include rulesthat are dependent on recent measures of clock drift or synchronizationas such between different parts of the system.

In the illustrated example of FIG. 5, alarm report 541 is determined tobe a one and only alarm report that falls within the alarms-inclusionwindow (not shown) of event report 555. There could have been none ofmore than one. As seen, the alarm report 541 may include an alarm typefield 542. The alarm type may be specified as a type number and/or moredescriptively; for example as a “Low Disk I/O Throughput” alarm (seecolumn 557 b of the annotation portion 557). The alarm report 541 mayinclude an alarm time field 543 that is indicative of a time point alongT1 that is associated with the alarm report 541, for example when thealarm was output or when a system state associated with the alarm tookplace. The alarm report 541 may include an alarm location field 544 thatis indicative of one or more locations within the system (e.g., 200 ofFIG. 2) for which the alarm was issued. The indicated alarm location maysimply identify the local system section (e.g., 240, 260) or it mayprovide more detailed information about the location involved with thealarmed situation (e.g., disk drive A544 of disk bank B543).

In accordance with one machine-implemented method of the presentdisclosure, for each detected event report (e.g., 555), itsalarms-inclusion window (not shown, corresponds to time 531) isdetermined and alarm reports (e.g., 541) which fall within thatalarms-inclusion window are identified by a time slot matching unit 551.More specifically, the time slot matching unit 551 fetches (555 a) thelog time of the given event report 555, determines the associatedalarms-inclusion window, tests for alarms along time line T1 (540) thatare within that alarms-inclusion window, fetches the alarm time (e.g.,543) of each such included alarm report and records the fetched alarmtime in a column (e.g., 557 a) of a being-formed, annotation portion557. For sake of simplicity it is assumed that there is only one alarmand one alarm time (e.g., Jun. 12, 2014, 9:14:44 AM) associated with thetopmost event record (e.g., Jun. 12, 2014, 9:14:47 AM) of the loggedevents portion 556. As indicated elsewhere, there could be many alarmsand they could be of same or different types. In the case of theexemplary topmost event record, there is one “Low Disk I/O Throughput”alarm and thus the matrix cell for that row and for column 557 b (Alarmsof the Type: Low Disk I/O Throughput) is marked “True” (or alternativelyas Alarm Sample Point, ASP included here). Since there are no otheralarms for this example, the remaining cells in the row are marked“False” (or alternatively as “excluded” or as No-Alarm Sample Point,NASP included here). In corresponding FIG. 4A, an ASP dot is added forexample to the Type “1” region 401 if that is the region of the Low DiskI/O Throughput alarms. The location of the added ASP dot in themulti-dimensional parameter space 400 is a multi-dimensional one. Asmore and more such ASP dots are added, one or more multi-dimensional,bounded shapes (e.g., spheroid 401) may become apparent as beingassociated with such Type “1” alarms.

For each detected alarm (e.g., 541) within the alarms-inclusion windowof a respective event record, an alarm type matching/adding unit 547automatically fetches (550 a) the alarm type 542 and searches among thealarm-type columns (e.g., 557 b, 557 c, 557 d, etc.) of the formingannotation portion 557 for a matching one. If a match is found, acorresponding True (or ASP here) indication is recorded in therespective matrix cell. If a match is not found, a new column adderfunction 548 is activated, a corresponding new column (e.g., 557 e-notshown) is added to the forming annotation portion 557 and acorresponding True (or ASP here) indication is recorded in therespective matrix cell of the newly added column. Thus the annotationportion 557 grows in size and complexity as new alarm types areencountered. It is to be understood that at some point, when apredetermined threshold for allowed number of detailed rows is reached,the information of older rows is summarized and stored in a rollingwindow of older statistics while the older detailed rows are freed bygarbage collection to make room for newer detailed rows.

Returning to step 330 of FIG. 3, as the respective system section (e.g.,240, 260) continues to execute within a real live environment or asimulated one, the corresponding annotated log (e.g., 256, 276)continues to build in terms of number of details and number of ASP'sadded to the corresponding parameter space (e.g., 400, 440 of FIG. 4A).Events that have no alarms become added NASP points.

Referring to step 340, after the annotated log (e.g., 256, 276) has beenbuilt up to a size of sufficient utility (where such size may vary fromsystem section to other section), the local post-process analyticsmodule (e.g., 257, 277) fetches the current build and in subsequent step350 uses it to perform alarms-aware analysis of the current sate of itssystem section and adaptive learning about how the local alarmsgenerator (e.g., 151′, 171′ of FIG. 2) appears to be working whilebuilding or updating its accessible sim version (e.g., 261, 281) of analarms generator and comparing 265, 285) outputs of the two so as todevelop insight regarding where the outputs of the real-time alarmsgenerator (e.g., 151′, 171′) and of the sim version (e.g., 261, 281)differ, and why, and what if anything should be done in response todetected differences (266, 286).

At juncture point 360, the process 300 has a number of options which arenot mutually exclusive (more than one can be carried out insubstantially the same time period). One of the options is to simplyreturn (363) to step 310 by way of path 315 and continue to run thesection, build the annotated log (step 330) and study it some more (step350).

Another option 364 is to take part or all of the current analytics forthe local system section and forward 365 the gathered analytics to ahierarchical parent analytics section (e.g., 295) which performs ahierarchically higher level of analytics on the results of two or moresystem sections. Step 367 represents a using by the hierarchical parentanalytics section (e.g., 295) of the forwarded sectional analytics and amaking of one or more adaptive changes at the super-sectional level. Yetanother option represented by 361/362 is to use the currently developedset of local analytics for making long term changes (hopefully,performance improving changes) to resources of the local system section.One of those changes can include reconfiguring the sim alarms generator(e.g., 261, 281) to more accurately mimic the behavior of the real-timealarms generator. Another of those changes can include reconfiguringother resources of the local system section so as to reduce those of theparameters that appear to be the main drivers behind excessive numbersand/or frequencies of certain kinds of alarms. For example, in the caseof column 557 b of FIG. 5, a driving cause of the low disk I/Othroughput might be because a certain thread of executing software isholding up other threads as the first thread waits for slow results. Asolution might be to cache some of the disk data into system RAM so itmay be more quickly accessible. Yet more specifically, the localpost-process analytics module (e.g., 257) might scan down column 556 cof the event log portion of the annotated log 556/557 and discover thatlow RAM utilization cross correlates with low disk I/O throughput incolumn 557 b while high RAM utilization cross correlates withNon-Alarmed sample points (NASP's) down column 557 b. The post-processanalytics module (e.g., 257) might then deduce (using appropriateartificial intelligence techniques such as expert knowledge basetechniques) that the problem might be solved by urging the systemsection dynamics into a mode that more often has high RAM utilization.This of course, is merely an example. In actual practice, the crosscorrelation patterns between alarms (portion 557) and parameters(portion 556) might be more complex than the given example.

Subsequent to steps 361 and/or 367, path 368 is taken back to step 305by way of path 315. The system (e.g., 200) keeps exercising its varioussections, building annotated logs for the respective sections, buildinghierarchical parent level, annotated logs for respective groups ofsections, performing post-process analytics for the respective sections,super-sections and even hierarchically higher up grandparent and so-onsections, and making appropriate changes based on the various analyticsresults.

Referring briefly back to FIG. 5, one of the local changes is that ofautomatically determining which parameters of the local parameter spaceare useful for delineating between alarmed behaviors and which are not.Cross correlation analysis may be used for identifying parameters thatcorrelate poorly relative to regions of different alarm types (e.g.,401, 402 of FIG. 4A). A parameters selection multiplexor 559 may beoperated to weed out the poorly correlating parameters. Comparator 565provides feedback to the post-process analytics unit 560 so that thelater can also adjust the internal logic of the accessible sim generator561.

Referring to FIG. 6, shown is a machine-implemented method 600 foridentifying localized and more widespread emerging problems within amulti-device, multi-sectional system such as 200 of FIG. 2. Entry ismade at 605.

In step 610, the annotated logs of the respective system sections areindividually built up as each section of the executing system performsits assigned tasks. In step 620, atypical changes of ASP additions torespective parameter spaces (e.g., 400, 440) are searched for and noted.The atypical changes of Alarmed Sample Point (ASP) additions may includeone or both of additions to regions (e.g., 404, 409) of parameter spacethat previously did not have ASP's in them and abnormal changes to rateof ASP additions in various subsections (see again, FIG. 4B). Moreover,within step 620, a determination is made as to whether the notedatypical changes are localized to one system section or constrained toone type of system sections or are more widespread and cotemporaneous soas to indicate a system-wide problem.

If the noted atypical changes are localized, then step 622 illustrates ageneralized step for ameliorating the emerging problem, namely,re-allocating resources (e.g., 141′-148′) of the local section (e.g.,240) in a manner which typically reduces the noted and emerging problem.Step 631 depicts a more specific example: IF the storage access rate perstorage unit in the section is too high relative to a predeterminedthreshold, THEN add one or more additional storage units (e.g., volatileor nonvolatile) to the affected section. Step 632 depicts anotherspecific example: IF storage access rate per storage unit in the localsection is too low relative to a predetermined threshold, THEN combinedata of two or more units into one and move a freed section storage unitof the section to a freed resources pool of the system. Yet anotherpossible solution is depicted in step 635: IF the data processing rateper DP unit is too high relative to a predetermined threshold, THEN addone or more additional DP units to the local section and re-assign someof the section tasks previously routed to the previously present DPunits to the newly added DP units. A further option is shown in step636: IF the data processing rate per DP unit is too low relative to apredetermined threshold, THEN move tasks of one DP unit to another ofthe same section and move the task-freed section DP unit to a freedresources pool of the system. Similarly, steps 637-638 depict too highand too low solutions for atypical data transfer rates involving theCOMM resources of the local system section. Symbol 640 represents yetmore of similar solutions for other resources of the affected section.Step 645 represents an automatically repeated search for crosscorrelations between event parameters and non-routine alarm occurrencesin the annotated logs of the local sections. One example was given forcolumns 556 c and 557 b of the FIG. 4A where it turned out that low RAMutilization correlated with low disk throughput (merely as ahypothetical example). Path 650 loops the process back to step 610 forautomatic repetition.

If the noted atypical changes are determined in step 620 to be morewidespread, then step 625 illustrates a generalized step forameliorating the emerging problem, namely, determine the source of theemerging widespread problem based on it being widespread. For example, akey communications fabric of the system may be experiencing problemsthat cannot be corrected with normal quick fix solutions. The systemmight be subject to a widespread denial of services attack. Step 651depicts the use of an expert knowledge base system to identify thelikely causes and likely best solutions for such emerging widespreadproblems based on their being widespread and simultaneously affectingsome regions of parameter space but not others.

Step 655 represents an automatically repeated search for crosscorrelations between event parameters and non-routine alarm occurrencesin the annotated logs of system super-sections (e.g., 295, 297). Oneexample might be that communication data transfer rates (COMM rates) areunusually low for certain kinds of system sections and the problemcorrelates to a time range where a newly installed communicationscontrol software package becomes activated. This being given merely as ahypothetical example. Path 650 loops the process back to step 610 forautomatic repetition.

The present disclosure is to be taken as illustrative rather than aslimiting the scope, nature, or spirit of the present teachings. Numerousmodifications and variations will become apparent to those skilled inthe art after studying the disclosure, including use of equivalentfunctional and/or structural substitutes for elements described herein,use of equivalent functional couplings for couplings described herein,and/or use of equivalent functional steps for steps described herein.Such insubstantial variations are to be considered within the scope ofwhat is contemplated and taught here. Moreover, if plural examples aregiven for specific means, or steps, and extrapolation between and/orbeyond such given examples is obvious in view of the present disclosure,then the disclosure is to be deemed as effectively disclosing and thuscovering at least such extrapolations.

Further, the functionalities described herein may be implementedentirely and non-abstractly as physical hardware, entirely as physicalnon-abstract software (including firmware, resident software,micro-code, etc.) or combining non-abstract software and hardwareimplementations that may all generally be referred to herein as a“circuit,” “module,” “component,” “block”, “database”, “agent” or“system.” Furthermore, aspects of the present disclosure may take theform of a computer program product embodied in one or more non-ephemeralcomputer readable media having computer readable and/or executableprogram code embodied thereon.

Any combination of one or more computer readable media may be utilized.The computer readable media may be a computer readable storage medium. Acomputer readable storage medium may be, for example, but not limitedto, an electronic, magnetic, optical, electromagnetic, infrared, orsemiconductor system, apparatus, or device, or any suitable combinationof the foregoing. More specific examples (a non-exhaustive list) of thecomputer readable storage medium would include the following: anappropriate electrical connection having one or more wires, a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), an appropriate optical fiber, a portable compact discread-only memory (CD-ROM), an optical storage device, a magnetic storagedevice, or any suitable combination of the foregoing. In the context ofthis document, a computer readable storage medium may be any tangiblemedium that can contain, or store a program for use by or in connectionwith an instruction execution system, apparatus, or device.

Computer program code for carrying out operations for aspects of thepresent disclosure may be written in any combination of one or moreprogramming languages, including an object oriented programming languagesuch as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET,Python or the like, conventional procedural programming languages, suchas the “C” programming language, Visual Basic, Fortran 2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby andGroovy, or other programming languages. The program code may executeentirely on the user's computer, partly on the user's computer, as astand-alone software package, partly on the user's computer and partlyon a remote computer or entirely on the remote computer or server. Inthe latter scenario, the remote computer may be connected to the user'scomputer through any type of network, including a local area network(LAN) or a wide area network (WAN), or the connection may be made to anexternal computer (for example, through the Internet using an InternetService Provider) or in a cloud computing environment or offered as aservice such as a Software as a Service (SaaS).

Aspects of the present disclosure are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatuses(systems) and computer program products according to embodiments of thedisclosure. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable instruction executionapparatus, create a mechanism for implementing the functions/actsspecified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computerreadable medium that when executed can direct/program a computer, otherprogrammable data processing apparatus, or other devices to function ina particular manner, such that the instructions when stored in thecomputer readable medium produce an article of manufacture includinginstructions which when executed, cause a computer to implement thefunction/act specified in the flowchart and/or block diagram block orblocks. The computer program instructions may also be loaded onto acomputer, other programmable instruction execution apparatus, or otherdevices to cause a series of operational steps to be performed on thecomputer, other programmable apparatuses or other devices to produce acomputer implemented process such that the instructions which execute onthe computer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

The flowcharts and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousaspects of the present disclosure. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof code, which comprises one or more executable instructions forimplementing the specified logical function(s). It should also be notedthat, in some alternative implementations, the functions noted in theblock may occur out of the order noted in the figures. For example, twoblocks shown in succession may, in fact, be executed substantiallyconcurrently, or the blocks may sometimes be executed in the reverseorder, depending upon the functionality involved. It will also be notedthat each block of the block diagrams and/or flowchart illustration, andcombinations of blocks in the block diagrams and/or flowchartillustration, can be implemented by special purpose hardware-basedsystems that perform the specified functions or acts, or combinations ofspecial purpose hardware and computer instructions.

The terminology used herein is for the purpose of describing particularaspects only and is not intended to be limiting of the disclosure. Asused herein, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises”and/or “comprising,” when used in this specification, specify thepresence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof.

The description of the present disclosure has been presented forpurposes of illustration and description, but is not intended to beexhaustive or limited to the disclosure in the form disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of thedisclosure. The aspects of the disclosure herein were chosen anddescribed in order to best explain the principles of the disclosure andthe practical application, and to enable others of ordinary skill in theart to understand the disclosure with various modifications as aresuited to the particular use contemplated.

The foregoing detailed description has been presented for purposes ofillustration and description. It is not intended to be exhaustive orlimiting to the precise form disclosed. Many modifications andvariations are possible in light of the above teaching. The describedembodiments were chosen in order to best explain the principles of thedisclosed technology and its practical application, to thereby enableothers skilled in the art to best utilize the technology in variousembodiments and with various modifications as are suited to theparticular use contemplated. It is intended that the scope be defined bythe claims appended hereto.

1.-20. (canceled)
 21. A machine-implemented method comprising: (a)automatically first determining whether newly emerging first non-routineanomalies are developing within a first section of a data processingsystem having at least first and second sections, each of the first andsecond sections including a respective behaviors logging subsystemconfigured to automatically log monitored behaviors within therespective section and a respective section alarming subsystemconfigured to automatically generate alarms for alarm worthy eventswithin the respective section, the first determining includingautomatically repeatedly building a first annotated log for the firstsection, the first annotated log providing logical co-associationsbetween logged behaviors produced by the respective section alarmingsubsystem of the first section and contemporaneously generated alarmsgenerated by the respective section alarming subsystem of the firstsection; (b) automatically second determining whether newly emergingsecond non-routine anomalies are developing within the second section ofthe data processing system, the second determining includingautomatically repeatedly building a second annotated log for the secondsection, the second annotated log providing logical co-associationsbetween logged behaviors produced by the respective section alarmingsubsystem of the second section and contemporaneously generated alarmsgenerated by the respective section alarming subsystem of the secondsection; (c) automatically third determining from the first and seconddeterminings whether the first and second non-routine anomalies developwithin a same specified time frame; and (d) in response to the thirddetermining indicating development within the same specified time frameof the first and second non-routine anomalies, automatically identifyingthe first and second sections as locations in the data processing systemwhere cotemporaneous and multi-sectional non-routine anomalies areemerging and automatically generating an alarm indicating that theemerging multi-sectional non-routine anomalies constitute a morewidespread problem than just anomalous behaviors in the first and secondsections individually.
 22. The method of claim 21 wherein: each of thefirst and second sections further comprises respective locallyintercoupled resources including one or more local data processing unitsand one or more local data storage units; and each of the first andsecond automatic determinings of whether newly emerging non-routineanomalies are developing comprises for each of the respective sections,using the respective annotated log to automatically repeatedly map intoa respective anomalies versus parameters mapping space, sample pointindicators indicative of respective coordinates in the mapping spacecorresponding to plural parameters associated with each generating andnon-generating of alarms by the respective section alarming subsystemand corresponding to the temporally co-associated, recently loggedbehaviors of the respective section.
 23. The method of claim 22 wherein:at least one of the parameters in a respective anomalies versusparameters mapping space of a respective section represents aperformance metric of at least one of the local data processing unitsand/or the local data storage units of the respective section.
 24. Themethod of claim 22 wherein: the automatically repeated mapping of thesample point indicators into a respective anomalies versus parametersmapping space includes mapping as alarmed sample points (ASP's) entriesin the respective annotated log for which both an alarm was generated bythe respective section alarming subsystem and one or more cotemporaneousevents were logged by the respective behaviors logging subsystem. 25.The method of claim 24 wherein: the automatically repeated mapping ofthe sample point indicators into a respective anomalies versusparameters mapping space includes mapping as non-alarmed sample points(NASP's) entries in the respective annotated log for which an alarm wasnot generated by the respective section alarming subsystem and one ormore cotemporaneous events were logged by the respective behaviorslogging subsystem.
 26. The method of claim 25 wherein: the automaticfirst and second determining of whether respective newly emergingnon-routine anomalies are developing within the respective section ofthe data processing system respectively include classifying regions inthe respective anomalies versus parameters mapping space populated byNASP's as regions in which ASP's do not routinely occur.
 27. The methodof claim 26 wherein: the automatic first and second determining ofwhether respective newly emerging non-routine anomalies are developingwithin the respective section of the data processing system respectivelyinclude identifying as newly emerging non-routine anomalies those ASP'sthat map into a region previously classified as one in which ASP's donot routinely occur.
 28. The method of claim 21 and further comprising:using at least one the respective annotated logs of the respective firstand second sections for creating at least one of respective behaviormimicking models of the first and second section alarming subsystems,the created at least one of the respective behavior mimicking modelshaving accessible internal logic structures configured to mimic outputbehaviors of the corresponding at least one of the first and secondsection alarming subsystems; for a specified time period during therunning of the at least one of the first and second sections, comparingalarms generated by the created at least one of respective behaviormimicking models with alarms generated by the respective at least one ofthe first and second section alarming sub systems; in response todetection of difference by said comparing step, modifying the respectiveinternal logic structures of the corresponding at least one of therespective behavior mimicking models so as to reduce difference insubsequent time periods; and automatically repeating said comparing andmodifying steps for the subsequent time periods.
 29. The method of claim28 wherein: the modifying step includes changing a subset of inputparameters that the at least one of the respective behavior mimickingmodels uses as its input parameters; and the changing of the subset ofinput parameters is responsive to automatically repeated updates made tothe corresponding at least one of the respective annotated logs of therespective first and second sections.
 30. The method of claim 22wherein: the respective anomalies versus parameters spaces of the firstand second sections respectively defined in a database storingcorresponding first and second data representing alarmed sample points(ASP's) of the first and second sections as points within correspondingfirst and second multi-parameter coordinate spaces where each respectivealarmed sample point (ASP) of the first and second sections respectivelycorrelates to one or more of the temporally corresponding generatings ofalarms by the corresponding one of the first and second section alarmingsubsystems; and at least one of parameter axes of the anomalies versusparameters second mapping space corresponds to one of the parameter axesof the anomalies versus parameters first mapping space such thatco-emergence within said same specified time frame of respective newlyemerging non-routine anomalies of the first and second sections can becross-correlated to one another as mapped along each of thecorresponding parameter axes of the first and second mapping spaces. 31.The method of claim 30 wherein: the anomalies versus parameters mappingspaces are respectively further defined in the database by stored seconddata representing non-alarmed sample points (NASP's) of the first andsecond sections as points within the respective first and secondmulti-parameters coordinate spaces where each non-alarmed sample point(NASP) correlates to an event logging time when there are no temporallycorresponding generatings of alarms by the respective one of the firstand second section alarming subsystems.
 32. The method of claim 22wherein: in addition to its respective one or more local data processingunits and its respective one or more local data storage units, at leastone of the first and second sections includes a corresponding datainput/output communicating unit; the recently logged behaviors of therespective generated log of the at least one of the first and secondsections includes a data processing rate of at least one of therespective local data processing units of the respective section, a dataaccess rate of at least one of the respective local data storage unitsof the respective section and a data communicating rate of the firstdata input/output communicating unit of the respective section.
 33. Themethod of claim 21 and further comprising: automatically repeatedlysearching for cross correlations between event parameters andnon-routine alarm occurrences in the respective annotated logs of thefirst and second sections.
 34. The method of claim 33 and furthercomprising: building a knowledge database based on found crosscorrelations between event parameters and non-routine alarm occurrencesin the respective annotated logs of first and second sections.
 35. Themethod of claim 28 wherein the data processing system has a hierarchicalstructure composed of plural parent sections and respective sectionswithin the parent sections, the first and second sections belonging to afirst parent section, the method further comprising: running a thirdsection of a second parent section within the data processing systemwhere the running third section includes as its respective sectionalarming subsystem, a third section alarming subsystem and includes asits respective section behaviors logging subsystem, a third sectionbehaviors logging subsystem, the third section alarming subsystem beingconfigured to generate alarms for non-catastrophic alarm-worthy eventsdetected within the third section, the third section behaviors loggingsubsystem being configured to generate a log of monitored behaviorswithin the third section; logically co-associating recently loggedbehaviors of the generated log produced by the third section behaviorslogging subsystem with substantially cotemporaneous alarms generated bythe third section alarming subsystem; building a third annotated logcomprised of the logically co-associated logged behaviors and thesubstantially cotemporaneous alarms of the third section; using thethird annotated log of the respective third section to create acorresponding third behavior mimicking model of the third sectionalarming subsystem, the created third behavior mimicking model havingaccessible internal logic structures configured to mimic outputbehaviors of the third section alarming subsystem; for the specifiedtime frame and during the running of the third section, comparing alarmsgenerated by the created third behavior mimicking model with alarmsgenerated by the corresponding third section alarming subsystem; inresponse to detection of differences by said comparing step for thethird section, modifying the respective internal logic structures of thecorresponding third behavior mimicking model so as to reduce futuredifferences; automatically repeating said comparing and modifying stepsfor subsequent time frames for the third behavior mimicking model; andbuilding a knowledge database over said subsequent time frames where theover-time built knowledge database provides insights as to operations ofthe third section alarming subsystem of the second parent section basedon access to the accessible internal logic structures of thecorresponding third behavior mimicking model.
 36. The method of claim 21and further comprising: merging the annotated logs of the sections thatrepresent hierarchical children of a first parent section of the dataprocessing system to thereby form a first parent annotated log; mergingthe annotated logs of the sections that represent hierarchical childrenof a second parent section of the data processing system to thereby forma second parent annotated log; automatically repeatedly searching forcross correlations between event parameters and non-routine alarmoccurrences in the respective first and second parent annotated logs ofthe first and second parent sections.
 37. The method of claim 36 andfurther comprising: building a knowledge database based on found crosscorrelations between event parameters and non-routine alarm occurrencesin the respective first and second parent annotated logs of the firstand second parent sections.
 38. A machine-implemented method ofdeveloping behavior mimicking and internals-accessible models ofrespective pre-configured alarming subsystems of a respective sectionsof a data processing system having a hierarchical structure composed ofplural parent sections and respective sections within the parentsections, wherein each section of a respective parent section compriseslocally intercoupled resources including one or more local dataprocessing units and one or more local data storage units, at least arespective one of the sections further comprising a respective sectionbehaviors logging subsystem configured to automatically log monitoredbehaviors within the respective section and a respective sectionalarming subsystem configured to automatically generate alarms for alarmworthy events within the respective section, the respective alarmingsubsystem not necessarily having internals that are easily accessiblefor determining why the respective alarming subsystem did or did notgenerate an alarm for a given event within the respective section, themethod comprising: running a first section of a respective first parentsection within the data processing system where the running firstsection includes a first section alarming subsystem as its respectivesection alarming subsystem and includes a first section behaviorslogging subsystem as its respective section behaviors logging subsystem,the first section alarming subsystem being configured to generate alarmsfor non-catastrophic alarm-worthy events detected within the firstsection, the first section behaviors logging subsystem being configuredto generate a log of monitored behaviors within the first section;logically co-associating recently logged behaviors of the generated logproduced by the first section behaviors logging subsystem withsubstantially cotemporaneous alarms generated by the first sectionalarming subsystem; building a first annotated log comprised of thelogically co-associated logged behaviors and the substantiallycotemporaneous alarms of the first section; using the first annotatedlog of the respective first section to create a corresponding firstbehavior mimicking model of the first section alarming subsystem, thecreated first behavior mimicking model having accessible internal logicstructures configured to mimic output behaviors of the first sectionalarming subsystem, the accessible internal logic structures beingconfigured to allow for determining why the first behavior mimickingmodel did or did not generate an alarm for a given event within thefirst section; for a specified time frame during the running of thefirst section, comparing alarms generated by the created first behaviormimicking model with alarms generated by the corresponding first sectionalarming subsystem; in response to detection of a difference by saidcomparing step, modifying the respective internal logic structures ofthe corresponding first behavior mimicking model so as to reduce futuredifferences; automatically repeating said comparing and modifying stepsfor subsequent time frames; and building a knowledge database over saidsubsequent time frames where the over-time built knowledge databaseprovides insights as to operations of the first section alarmingsubsystem based on access to the accessible internal logic structures ofthe corresponding first behavior mimicking model.
 39. The method ofclaim 38 and further comprising: concurrently running a second sectionof the first parent section within the data processing system where therunning second section includes a second section alarming subsystem asits respective section alarming subsystem, and includes a second sectionbehaviors logging subsystem as its respective section behaviors loggingsubsystem, the second section alarming subsystem being configured togenerate alarms for non-catastrophic alarm-worthy events detected withinthe second section, the second section behaviors logging subsystem beingconfigured to generate a log of monitored behaviors within the secondsection; logically co-associating recently logged behaviors of thegenerated log produced by the second section behaviors logging subsystemwith substantially cotemporaneous alarms generated by the second sectionalarming subsystem; building a second annotated log comprised of thelogically co-associated logged behaviors and the substantiallycotemporaneous alarms of the second section; using the second annotatedlog of the respective second section to create a corresponding secondbehavior mimicking model of the second section alarming subsystem, thecreated second behavior mimicking model having accessible internal logicstructures configured to mimic output behaviors of the second sectionalarming subsystem; for a specified time frame during the running of thesecond section, respectively comparing alarms generated by the createdsecond behavior mimicking model with alarms generated by thecorresponding second section alarming subsystem; in response todetection of a corresponding difference by said respective comparingstep, modifying the respective internal logic structures of thecorresponding second behavior mimicking model so as to reduce futuredifferences; automatically repeating said respective comparing andmodifying steps for subsequent time frames for the second behaviormimicking model; and building a knowledge database over said subsequenttime frames where the over-time built knowledge database providesinsights as to operations of the second section alarming subsystem basedon access to the accessible internal logic structures of thecorresponding second behavior mimicking model.
 40. A data processingsystem configured to deal with emerging non-routine anomalies within oneor more of plural sections of the data processing system, the emergingnon-routine anomalies developing in one or the other of localizedportions of the data processing system or on a more widespread basis andnot being catastrophic failures, the data processing system beingsubdivided into a plurality of parent sections with each parent sectioncomprising respective plural sections, each section having locallyintercoupled resources including one or more local data processing unitsand one or more local data storage units, wherein at least onerespective section of a respective two or more of the plural parentsections each respectively includes a respective section behaviorslogging subsystem configured to automatically log monitored behaviorswithin the respective section and to generate a respective local log andeach of the at least one respective sections respectively includes arespective section alarming subsystem configured to automaticallygenerate alarms for alarm worthy events within the respective section,the data processing system further comprising: an annotated logs storingdatabase storing one or more respective annotated logs that respectivelyindicate correlations for respective ones of the system sections betweenrecently logged behaviors of the respective system sections as recentlyrecorded in the respective local logs of the respective sections andtemporally correlated generatings and non-generatings of alarms by therespective section alarming subsystems of the respective systemsections; an annotated logs builder, coupled to the database andconfigured to automatically repeatedly for respective ones of thesections, add to the respective stored and annotated logs of therespective sections additional samples of temporal correlations betweenrecently logged behaviors logged in the respective local logs andtemporally corresponding generatings and non-generatings of alarms bythe respective section alarming subsystems of the respective sections;and a post-process analytics portion of the data processing system thatis operatively coupled to respective ones of the annotated logs storedin the database for the respective sections and is configured toautomatically repeatedly map into respective anomalies versus parametersmapping spaces of respective ones of the system sections, sample pointindicators indicative of respective coordinates in the respectivemapping space corresponding to plural parameters associated with eachgenerating and non-generating of alarms by the respective sectionalarming subsystem of the respective sections and corresponding totemporal correlated, recently logged behaviors of the respective locallog produced by the section behaviors logging subsystem of thatrespective section; wherein the post-process analytics portion isconfigured to flag out abnormal changes over time in the automaticallyrepeatedly made mappings of the sample point indicators into therespective anomalies versus parameters mapping spaces, where the flaggedout abnormal changes include those representing emerging non-routineanomalies that are not catastrophic failures; and wherein thepost-process analytics portion is configured to flag out concurrentdevelopment for two or more respective sections within one or withinplural ones of the parent sections of respective newly emergingnon-routine anomalies where such concurrent development for the two ormore respective sections is indicative of emergence of non-routineanomalies on a more widespread basis than just separately inindividualized ones of the sections.